Keras: Deep Learning library for Theano and TensorFlow

Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.

It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. ref: https://keras.io/

Kaggle Challenge Data

The Otto Group is one of the world’s biggest e-commerce companies, A consistent analysis of the performance of products is crucial. However, due to diverse global infrastructure, many identical products get classified differently. For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.

https://www.kaggle.com/c/otto-group-product-classification-challenge/data

For this section we will use the Kaggle Otto Group Challenge Data. You will find these data in

../data/kaggle_ottogroup/ folder.

Logistic Regression

This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve problems of classification (supervised learning).

In fact, to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid.

It is precisely because of this feature we call this algorithm logistic regression.

Data Preparation


In [1]:
from kaggle_data import load_data, preprocess_data, preprocess_labels
import numpy as np
import matplotlib.pyplot as plt


Using TensorFlow backend.

In [2]:
X_train, labels = load_data('../data/kaggle_ottogroup/train.csv', train=True)
X_train, scaler = preprocess_data(X_train)
Y_train, encoder = preprocess_labels(labels)

X_test, ids = load_data('../data/kaggle_ottogroup/test.csv', train=False)
X_test, _ = preprocess_data(X_test, scaler)

nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')

dims = X_train.shape[1]
print(dims, 'dims')


9 classes
93 dims

In [3]:
np.unique(labels)


Out[3]:
array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

In [4]:
Y_train  # one-hot encoding


Out[4]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.]])

Using Theano


In [5]:
import theano as th
import theano.tensor as T

In [6]:
#Based on example from DeepLearning.net
rng = np.random
N = 400
feats = 93
training_steps = 10

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = th.shared(rng.randn(feats), name="w")
b = th.shared(0., name="b")

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))         # Probability that target = 1
prediction = p_1 > 0.5                          # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1)   # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()      # The cost to minimize
gw, gb = T.grad(cost, [w, b])                   # Compute the gradient of the cost
                                                

# Compile
train = th.function(
          inputs=[x,y],
          outputs=[prediction, xent],
          updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)),
          allow_input_downcast=True)
predict = th.function(inputs=[x], outputs=prediction, allow_input_downcast=True)

#Transform for class1
y_class1 = []
for i in Y_train:
    y_class1.append(i[0])
y_class1 = np.array(y_class1)

# Train
for i in range(training_steps):
    print('Epoch %s' % (i+1,))
    pred, err = train(X_train, y_class1)

print("target values for Data:")
print(y_class1)
print("prediction on training set:")
print(predict(X_train))


Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
target values for Data:
[ 0.  0.  0. ...,  0.  0.  0.]
prediction on training set:
[ True  True False ...,  True  True  True]

Using Tensorflow


In [7]:
import tensorflow as tf

In [8]:
# Parameters
learning_rate = 0.01
training_epochs = 25
display_step = 1

In [9]:
# tf Graph Input
x = tf.placeholder("float", [None, dims]) 
y = tf.placeholder("float", [None, nb_classes])

In [10]:
x


Out[10]:
<tf.Tensor 'Placeholder:0' shape=(?, 93) dtype=float32>

Model (Introducing Tensorboard)


In [11]:
# Construct (linear) model
with tf.name_scope("model") as scope:
    # Set model weights
    W = tf.Variable(tf.zeros([dims, nb_classes]))
    b = tf.Variable(tf.zeros([nb_classes]))
    activation = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax

    # Add summary ops to collect data
    w_h = tf.summary.histogram("weights_histogram", W)
    b_h = tf.summary.histogram("biases_histograms", b)
    tf.summary.scalar('mean_weights', tf.reduce_mean(W))
    tf.summary.scalar('mean_bias', tf.reduce_mean(b))

# Minimize error using cross entropy
# Note: More name scopes will clean up graph representation
with tf.name_scope("cost_function") as scope:
    cross_entropy = y*tf.log(activation)
    cost = tf.reduce_mean(-tf.reduce_sum(cross_entropy,reduction_indices=1))
    # Create a summary to monitor the cost function
    tf.summary.scalar("cost_function", cost)
    tf.summary.histogram("cost_histogram", cost)

with tf.name_scope("train") as scope:
    # Set the Optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Accuracy


In [12]:
with tf.name_scope('Accuracy') as scope:
    correct_prediction = tf.equal(tf.argmax(activation, 1), tf.argmax(y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    # Create a summary to monitor the cost function
    tf.summary.scalar("accuracy", accuracy)

Learning in a TF Session


In [13]:
LOGDIR = "/tmp/logistic_logs"
import os, shutil
if os.path.isdir(LOGDIR):
    shutil.rmtree(LOGDIR)
os.mkdir(LOGDIR)

# Plug TensorBoard Visualisation 
writer = tf.summary.FileWriter(LOGDIR, graph=tf.get_default_graph())

In [14]:
for var in tf.get_collection(tf.GraphKeys.SUMMARIES):
    print(var.name)
    
summary_op = tf.summary.merge_all()
print('Summary Op: ' + summary_op)


model/weights_histogram:0
model/biases_histograms:0
model/mean_weights:0
model/mean_bias:0
cost_function/cost_function:0
cost_function/cost_histogram:0
Accuracy/accuracy:0
Tensor("add:0", shape=(), dtype=string)

In [15]:
# Launch the graph
with tf.Session() as session:
    # Initializing the variables
    session.run(tf.global_variables_initializer())
    
    cost_epochs = []
    # Training cycle
    for epoch in range(training_epochs):
        _, summary, c = session.run(fetches=[optimizer, summary_op, cost], 
                                    feed_dict={x: X_train, y: Y_train})
        cost_epochs.append(c)
        writer.add_summary(summary=summary, global_step=epoch)
        print("accuracy epoch {}:{}".format(epoch, accuracy.eval({x: X_train, y: Y_train})))
        
    print("Training phase finished")
    
    #plotting
    plt.plot(range(len(cost_epochs)), cost_epochs, 'o', label='Logistic Regression Training phase')
    plt.ylabel('cost')
    plt.xlabel('epoch')
    plt.legend()
    plt.show()
    
    prediction = tf.argmax(activation, 1)
    print(prediction.eval({x: X_test}))


accuracy epoch 0:0.6649535894393921
accuracy epoch 1:0.665276825428009
accuracy epoch 2:0.6657131910324097
accuracy epoch 3:0.6659556031227112
accuracy epoch 4:0.6662949919700623
accuracy epoch 5:0.6666181683540344
accuracy epoch 6:0.6668121218681335
accuracy epoch 7:0.6671029925346375
accuracy epoch 8:0.6674585342407227
accuracy epoch 9:0.6678463816642761
accuracy epoch 10:0.6680726408958435
accuracy epoch 11:0.6682504415512085
accuracy epoch 12:0.6684605479240417
accuracy epoch 13:0.6687514185905457
accuracy epoch 14:0.6690422892570496
accuracy epoch 15:0.6692523956298828
accuracy epoch 16:0.6695109605789185
accuracy epoch 17:0.6697695255279541
accuracy epoch 18:0.6699796319007874
accuracy epoch 19:0.6702220439910889
accuracy epoch 20:0.6705452799797058
accuracy epoch 21:0.6708361506462097
accuracy epoch 22:0.6710785627365112
accuracy epoch 23:0.671385645866394
accuracy epoch 24:0.6716926693916321
Training phase finished
[1 5 5 ..., 2 1 1]

In [16]:
%%bash
python -m tensorflow.tensorboard --logdir=/tmp/logistic_logs


Process is terminated.

Using Keras


In [17]:
from keras.models import Sequential
from keras.layers import Dense, Activation

In [18]:
dims = X_train.shape[1]
print(dims, 'dims')
print("Building model...")

nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')

model = Sequential()
model.add(Dense(nb_classes, input_shape=(dims,), activation='sigmoid'))
model.add(Activation('softmax'))

model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X_train, Y_train)


93 dims
Building model...
9 classes
Epoch 1/10
61878/61878 [==============================] - 3s - loss: 1.9845     
Epoch 2/10
61878/61878 [==============================] - 2s - loss: 1.8337     
Epoch 3/10
61878/61878 [==============================] - 2s - loss: 1.7779     
Epoch 4/10
61878/61878 [==============================] - 3s - loss: 1.7432     
Epoch 5/10
61878/61878 [==============================] - 2s - loss: 1.7187     
Epoch 6/10
61878/61878 [==============================] - 3s - loss: 1.7002     
Epoch 7/10
61878/61878 [==============================] - 2s - loss: 1.6857     
Epoch 8/10
61878/61878 [==============================] - 2s - loss: 1.6739     
Epoch 9/10
61878/61878 [==============================] - 2s - loss: 1.6642     
Epoch 10/10
61878/61878 [==============================] - 2s - loss: 1.6560     
Out[18]:
<keras.callbacks.History at 0x123026dd8>

Simplicity is pretty impressive right? :)

Theano:

shape = (channels, rows, cols)

Tensorflow:

shape = (rows, cols, channels)

image_data_format : channels_last | channels_first


In [19]:
!cat ~/.keras/keras.json


{
	"epsilon": 1e-07,
	"backend": "tensorflow",
	"floatx": "float32",
	"image_data_format": "channels_last"
}

Now lets understand:

The core data structure of Keras is a model, a way to organize layers. The main type of model is the Sequential model, a linear stack of layers.

What we did here is stacking a Fully Connected (Dense) layer of trainable weights from the input to the output and an Activation layer on top of the weights layer.

Dense
from keras.layers.core import Dense

Dense(units, activation=None, use_bias=True, 
      kernel_initializer='glorot_uniform', bias_initializer='zeros', 
      kernel_regularizer=None, bias_regularizer=None, 
      activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
  • units: int > 0.

  • init: name of initialization function for the weights of the layer (see initializations), or alternatively, Theano function to use for weights initialization. This parameter is only relevant if you don't pass a weights argument.

  • activation: name of activation function to use (see activations), or alternatively, elementwise Theano function. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).

  • weights: list of Numpy arrays to set as initial weights. The list should have 2 elements, of shape (input_dim, output_dim) and (output_dim,) for weights and biases respectively.

  • kernel_regularizer: instance of WeightRegularizer (eg. L1 or L2 regularization), applied to the main weights matrix.

  • bias_regularizer: instance of WeightRegularizer, applied to the bias.

  • activity_regularizer: instance of ActivityRegularizer, applied to the network output.

  • kernel_constraint: instance of the constraints module (eg. maxnorm, nonneg), applied to the main weights matrix.

  • bias_constraint: instance of the constraints module, applied to the bias.

  • use_bias: whether to include a bias (i.e. make the layer affine rather than linear).

(some) others keras.core.layers

  • keras.layers.core.Flatten()
  • keras.layers.core.Reshape(target_shape)
  • keras.layers.core.Permute(dims)
model = Sequential()
model.add(Permute((2, 1), input_shape=(10, 64)))
# now: model.output_shape == (None, 64, 10)
# note: `None` is the batch dimension
  • keras.layers.core.Lambda(function, output_shape=None, arguments=None)
  • keras.layers.core.ActivityRegularization(l1=0.0, l2=0.0)

Credits: Yam Peleg (@Yampeleg)

Activation
from keras.layers.core import Activation

Activation(activation)

Supported Activations : [https://keras.io/activations/]

Advanced Activations: [https://keras.io/layers/advanced-activations/]

Optimizer

If you need to, you can further configure your optimizer. A core principle of Keras is to make things reasonably simple, while allowing the user to be fully in control when they need to (the ultimate control being the easy extensibility of the source code). Here we used SGD (stochastic gradient descent) as an optimization algorithm for our trainable weights.

"Data Sciencing" this example a little bit more

What we did here is nice, however in the real world it is not useable because of overfitting. Lets try and solve it with cross validation.

Overfitting

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

To avoid overfitting, we will first split out data to training set and test set and test out model on the test set.
Next: we will use two of keras's callbacks EarlyStopping and ModelCheckpoint

Let's see first the model we implemented


In [20]:
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 9)                 846       
_________________________________________________________________
activation_1 (Activation)    (None, 9)                 0         
=================================================================
Total params: 846
Trainable params: 846
Non-trainable params: 0
_________________________________________________________________

In [21]:
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [22]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.15, random_state=42)

fBestModel = 'best_model.h5' 
early_stop = EarlyStopping(monitor='val_loss', patience=2, verbose=1) 
best_model = ModelCheckpoint(fBestModel, verbose=0, save_best_only=True)

model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=50, 
          batch_size=128, verbose=True, callbacks=[best_model, early_stop])


Train on 52596 samples, validate on 9282 samples
Epoch 1/50
52596/52596 [==============================] - 1s - loss: 1.6516 - val_loss: 1.6513
Epoch 2/50
52596/52596 [==============================] - 0s - loss: 1.6501 - val_loss: 1.6499
Epoch 3/50
52596/52596 [==============================] - 1s - loss: 1.6488 - val_loss: 1.6486
Epoch 4/50
52596/52596 [==============================] - 1s - loss: 1.6474 - val_loss: 1.6473
Epoch 5/50
52596/52596 [==============================] - 0s - loss: 1.6462 - val_loss: 1.6461
Epoch 6/50
52596/52596 [==============================] - 0s - loss: 1.6449 - val_loss: 1.6448
Epoch 7/50
52596/52596 [==============================] - 0s - loss: 1.6437 - val_loss: 1.6437
Epoch 8/50
52596/52596 [==============================] - 0s - loss: 1.6425 - val_loss: 1.6425
Epoch 9/50
52596/52596 [==============================] - 0s - loss: 1.6414 - val_loss: 1.6414
Epoch 10/50
52596/52596 [==============================] - 0s - loss: 1.6403 - val_loss: 1.6403
Epoch 11/50
52596/52596 [==============================] - 0s - loss: 1.6392 - val_loss: 1.6393
Epoch 12/50
52596/52596 [==============================] - 0s - loss: 1.6382 - val_loss: 1.6383
Epoch 13/50
52596/52596 [==============================] - 1s - loss: 1.6372 - val_loss: 1.6373
Epoch 14/50
52596/52596 [==============================] - 0s - loss: 1.6362 - val_loss: 1.6363
Epoch 15/50
52596/52596 [==============================] - 0s - loss: 1.6352 - val_loss: 1.6354
Epoch 16/50
52596/52596 [==============================] - 0s - loss: 1.6343 - val_loss: 1.6345
Epoch 17/50
52596/52596 [==============================] - 0s - loss: 1.6334 - val_loss: 1.6336
Epoch 18/50
52596/52596 [==============================] - 0s - loss: 1.6325 - val_loss: 1.6327
Epoch 19/50
52596/52596 [==============================] - 0s - loss: 1.6316 - val_loss: 1.6319
Epoch 20/50
52596/52596 [==============================] - 0s - loss: 1.6308 - val_loss: 1.6311
Epoch 21/50
52596/52596 [==============================] - 0s - loss: 1.6300 - val_loss: 1.6303
Epoch 22/50
52596/52596 [==============================] - 0s - loss: 1.6292 - val_loss: 1.6295
Epoch 23/50
52596/52596 [==============================] - 0s - loss: 1.6284 - val_loss: 1.6287
Epoch 24/50
52596/52596 [==============================] - 0s - loss: 1.6276 - val_loss: 1.6280
Epoch 25/50
52596/52596 [==============================] - 0s - loss: 1.6269 - val_loss: 1.6273
Epoch 26/50
52596/52596 [==============================] - 0s - loss: 1.6262 - val_loss: 1.6265
Epoch 27/50
52596/52596 [==============================] - 0s - loss: 1.6254 - val_loss: 1.6258
Epoch 28/50
52596/52596 [==============================] - 0s - loss: 1.6247 - val_loss: 1.6252
Epoch 29/50
52596/52596 [==============================] - 0s - loss: 1.6241 - val_loss: 1.6245
Epoch 30/50
52596/52596 [==============================] - 0s - loss: 1.6234 - val_loss: 1.6238
Epoch 31/50
52596/52596 [==============================] - 0s - loss: 1.6227 - val_loss: 1.6232
Epoch 32/50
52596/52596 [==============================] - 0s - loss: 1.6221 - val_loss: 1.6226
Epoch 33/50
52596/52596 [==============================] - 0s - loss: 1.6215 - val_loss: 1.6220
Epoch 34/50
52596/52596 [==============================] - 1s - loss: 1.6209 - val_loss: 1.6214
Epoch 35/50
52596/52596 [==============================] - 0s - loss: 1.6203 - val_loss: 1.6208
Epoch 36/50
52596/52596 [==============================] - 0s - loss: 1.6197 - val_loss: 1.6202
Epoch 37/50
52596/52596 [==============================] - 0s - loss: 1.6191 - val_loss: 1.6197
Epoch 38/50
52596/52596 [==============================] - 0s - loss: 1.6186 - val_loss: 1.6191
Epoch 39/50
52596/52596 [==============================] - 0s - loss: 1.6180 - val_loss: 1.6186
Epoch 40/50
52596/52596 [==============================] - 0s - loss: 1.6175 - val_loss: 1.6181
Epoch 41/50
52596/52596 [==============================] - 0s - loss: 1.6170 - val_loss: 1.6175
Epoch 42/50
52596/52596 [==============================] - 0s - loss: 1.6165 - val_loss: 1.6170
Epoch 43/50
52596/52596 [==============================] - 0s - loss: 1.6160 - val_loss: 1.6166
Epoch 44/50
52596/52596 [==============================] - 0s - loss: 1.6155 - val_loss: 1.6161
Epoch 45/50
52596/52596 [==============================] - 0s - loss: 1.6150 - val_loss: 1.6156
Epoch 46/50
52596/52596 [==============================] - 0s - loss: 1.6145 - val_loss: 1.6151
Epoch 47/50
52596/52596 [==============================] - 0s - loss: 1.6141 - val_loss: 1.6147
Epoch 48/50
52596/52596 [==============================] - 0s - loss: 1.6136 - val_loss: 1.6142
Epoch 49/50
52596/52596 [==============================] - 0s - loss: 1.6132 - val_loss: 1.6138
Epoch 50/50
52596/52596 [==============================] - 0s - loss: 1.6127 - val_loss: 1.6134
Out[22]:
<keras.callbacks.History at 0x11e7a2710>

Multi-Layer Fully Connected Networks

Forward and Backward Propagation

Q: How hard can it be to build a Multi-Layer Fully-Connected Network with keras?

A: It is basically the same, just add more layers!


In [23]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_2 (Dense)              (None, 100)               9400      
_________________________________________________________________
dense_3 (Dense)              (None, 9)                 909       
_________________________________________________________________
activation_2 (Activation)    (None, 9)                 0         
=================================================================
Total params: 10,309
Trainable params: 10,309
Non-trainable params: 0
_________________________________________________________________

In [24]:
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20, 
          batch_size=128, verbose=True)


Train on 52596 samples, validate on 9282 samples
Epoch 1/20
52596/52596 [==============================] - 1s - loss: 1.2113 - val_loss: 0.8824
Epoch 2/20
52596/52596 [==============================] - 0s - loss: 0.8229 - val_loss: 0.7851
Epoch 3/20
52596/52596 [==============================] - 0s - loss: 0.7623 - val_loss: 0.7470
Epoch 4/20
52596/52596 [==============================] - 1s - loss: 0.7329 - val_loss: 0.7258
Epoch 5/20
52596/52596 [==============================] - 1s - loss: 0.7143 - val_loss: 0.7107
Epoch 6/20
52596/52596 [==============================] - 0s - loss: 0.7014 - val_loss: 0.7005
Epoch 7/20
52596/52596 [==============================] - 1s - loss: 0.6918 - val_loss: 0.6922
Epoch 8/20
52596/52596 [==============================] - 0s - loss: 0.6843 - val_loss: 0.6868
Epoch 9/20
52596/52596 [==============================] - 0s - loss: 0.6784 - val_loss: 0.6817
Epoch 10/20
52596/52596 [==============================] - 0s - loss: 0.6736 - val_loss: 0.6773
Epoch 11/20
52596/52596 [==============================] - 0s - loss: 0.6695 - val_loss: 0.6739
Epoch 12/20
52596/52596 [==============================] - 1s - loss: 0.6660 - val_loss: 0.6711
Epoch 13/20
52596/52596 [==============================] - 1s - loss: 0.6631 - val_loss: 0.6688
Epoch 14/20
52596/52596 [==============================] - 1s - loss: 0.6604 - val_loss: 0.6670
Epoch 15/20
52596/52596 [==============================] - 1s - loss: 0.6582 - val_loss: 0.6649
Epoch 16/20
52596/52596 [==============================] - 1s - loss: 0.6563 - val_loss: 0.6626
Epoch 17/20
52596/52596 [==============================] - 1s - loss: 0.6545 - val_loss: 0.6611
Epoch 18/20
52596/52596 [==============================] - 1s - loss: 0.6528 - val_loss: 0.6598
Epoch 19/20
52596/52596 [==============================] - 1s - loss: 0.6514 - val_loss: 0.6578
Epoch 20/20
52596/52596 [==============================] - 1s - loss: 0.6500 - val_loss: 0.6571
Out[24]:
<keras.callbacks.History at 0x12830b978>

Your Turn!

Hands On - Keras Fully Connected

Take couple of minutes and try to play with the number of layers and the number of parameters in the layers to get the best results.


In [25]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))

# ...
# ...
# Play with it! add as much layers as you want! try and get better results.

model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 100)               9400      
_________________________________________________________________
dense_5 (Dense)              (None, 9)                 909       
_________________________________________________________________
activation_3 (Activation)    (None, 9)                 0         
=================================================================
Total params: 10,309
Trainable params: 10,309
Non-trainable params: 0
_________________________________________________________________

In [26]:
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20, 
          batch_size=128, verbose=True)


Train on 52596 samples, validate on 9282 samples
Epoch 1/20
52596/52596 [==============================] - 1s - loss: 1.2107 - val_loss: 0.8821
Epoch 2/20
52596/52596 [==============================] - 1s - loss: 0.8204 - val_loss: 0.7798
Epoch 3/20
52596/52596 [==============================] - 1s - loss: 0.7577 - val_loss: 0.7393
Epoch 4/20
52596/52596 [==============================] - 0s - loss: 0.7280 - val_loss: 0.7176
Epoch 5/20
52596/52596 [==============================] - 1s - loss: 0.7097 - val_loss: 0.7028
Epoch 6/20
52596/52596 [==============================] - 1s - loss: 0.6973 - val_loss: 0.6929
Epoch 7/20
52596/52596 [==============================] - 1s - loss: 0.6883 - val_loss: 0.6858
Epoch 8/20
52596/52596 [==============================] - 1s - loss: 0.6813 - val_loss: 0.6804
Epoch 9/20
52596/52596 [==============================] - 1s - loss: 0.6757 - val_loss: 0.6756
Epoch 10/20
52596/52596 [==============================] - 1s - loss: 0.6711 - val_loss: 0.6722
Epoch 11/20
52596/52596 [==============================] - 1s - loss: 0.6672 - val_loss: 0.6692
Epoch 12/20
52596/52596 [==============================] - 0s - loss: 0.6641 - val_loss: 0.6667
Epoch 13/20
52596/52596 [==============================] - 0s - loss: 0.6613 - val_loss: 0.6636
Epoch 14/20
52596/52596 [==============================] - 0s - loss: 0.6589 - val_loss: 0.6620
Epoch 15/20
52596/52596 [==============================] - 0s - loss: 0.6568 - val_loss: 0.6606
Epoch 16/20
52596/52596 [==============================] - 0s - loss: 0.6546 - val_loss: 0.6589
Epoch 17/20
52596/52596 [==============================] - 0s - loss: 0.6531 - val_loss: 0.6577
Epoch 18/20
52596/52596 [==============================] - 0s - loss: 0.6515 - val_loss: 0.6568
Epoch 19/20
52596/52596 [==============================] - 0s - loss: 0.6501 - val_loss: 0.6546
Epoch 20/20
52596/52596 [==============================] - 0s - loss: 0.6489 - val_loss: 0.6539
Out[26]:
<keras.callbacks.History at 0x1285bae80>

Building a question answering system, an image classification model, a Neural Turing Machine, a word2vec embedder or any other model is just as fast. The ideas behind deep learning are simple, so why should their implementation be painful?

Theoretical Motivations for depth

Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth!

[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015

Theoretical Motivations for depth

One much quoted theorem about neural network states that:

Universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

[1] - Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik 1991